DNS Detection Pipeline

Edward Crowder & Ahmad Chaiban

The implementation of our project is contained in this portion of the project.

Data Source:

Importing the data and Preprocessing

The dataset used for this implementation includes DoH protocol captures of Benign-DoH and Malicious-DoH. The browsers and tools used to capture this traffic include Google Chrome, Mozilla Firefox, dns2tcp, DNSCat2, and Iodine while the servers used to respond to DoH requests are AdGuard, Cloudflare, Google DNS, and Quad9.

At this stage the data was imported and preprocessed to set up for the various analysis, feature selection and classification techniques (both supervised and unsupervised learning).

Feature list

Merging Data

Unsupervised Learning

t-SNE 2D & 3D

After the initial preprocessing, t-SNE was applied to the data set in order to visualize the features.

Undersampling

Undersampling the dataset is a vital step. Not only is the data slightly imbalanced, but a method such as cluster centroids allows for outlier elimination as well as clearer visualization with unsupervised learning methods such as t-SNE.

t-SNE (After undersampling)

After undersampling, t-SNE was applied once more, and yielded both 2D and 3D visualizations that showed some clear separations in both the benign and malicious data.

Normalize

An important step for improving the classification is simply to normalize this data.

Feature selection

In order to provide the most optimal results that require the least number of, and most descriptive, features, the pearson correlation matrix was utilized for feature selection.

Pearson Correlation Matrix

Chi Squared

Features To Keep

With regard to the correlation matrix, the following features were selection to remain in the training dataset.

t-SNE (After undersampling and feature selection)

Classifiers

Three classifiers were trained on the data, a Support Vector Machine, a Random Forest Classifier, and an LSTM Neural Network.

Train-test-validation-split

The train-test-validation split selected for training the three classifiers is as follows:

Support Vector Machine

Acceptable results

The following hyperparameters lead to optimistic results

svm_model = svm.SVC(C=1.0, gamma=1e4, kernel='rbf') ## 0.93
svm_model = svm.SVC(C=1.0, gamma=1e5, kernel='rbf') ## 0.82
svm_model = svm.SVC(C=1.0, gamma=1e6, kernel='rbf') ## 0.76
svm_model = svm.SVC(C=30.0, gamma=1e6, kernel='rbf') ## 0.76

Plotting SVM training curve

Random Forest Classifier

Plotting RFC Training curves

LSTM Neural Network

Neural Network 2

Classifier Evaluation

Recall: SVM, RFC, LSTM & ANN

Confusion Matrices

ROC (Receiver Operating Characteristic) Curves

Testing on Original Data Unsampled data

SVM

RFC

LSTM

ANN

Confusion matrices for whole dataset

Prevention

The following basic rules serve as a good compliment to the ML pipeline.